7BUIS008W Data Mining & Machine Learning - Coursework 1

Andrew Keats

22 November 2017

Table of Contents

This is the test main view

second header

para 1

third header

para 2

Question 1: White Wine clustering

Starting off

You need to conduct the k-means clustering analysis of the white wine sheet. Find the ideal number of clusters (please justify your answer). Choose the best two possible numbers of clusters and perform the k-means algorithm for both candidates. Validate which clustering test is more accurate. For the winning test, get the mean of the each attribute of each group. Before conducting the k-means, please investigate if you need to add in your code any pre-processing task (justify your answer). Write a code in R Studio to address all the above issues. In your report, check the consistency of those produced clusters, with information obtained from column 12.

In the White Win dataset provided, column 12 is labelled Qualirty; this is a qualitative value assigned by a human through the subjective means of tasting. Essentially, by try to cluster against all variables apart from Quality and then comparing against this variable, we are trying to look for some correlation between all the variables in combination and the subjkective quality of wine.

Firstly we need to load the data…

#going to import the Excel spreadsheet WhiteWine dataset
wine.raw <- read_excel("../data/Whitewine.xlsx")

Here’s a glance at the datset

head(wine.raw)
## # A tibble: 6 x 12
##   `fixed acidity` `volatile acidity` `citric acid` `residual sugar`
##             <dbl>              <dbl>         <dbl>            <dbl>
## 1             7.0               0.27          0.36             20.7
## 2             6.3               0.30          0.34              1.6
## 3             8.1               0.28          0.40              6.9
## 4             7.2               0.23          0.32              8.5
## 5             7.2               0.23          0.32              8.5
## 6             8.1               0.28          0.40              6.9
## # ... with 8 more variables: chlorides <dbl>, `free sulfur dioxide` <dbl>,
## #   `total sulfur dioxide` <dbl>, density <dbl>, pH <dbl>,
## #   sulphates <dbl>, alcohol <dbl>, quality <dbl>
str(wine.raw)
## Classes 'tbl_df', 'tbl' and 'data.frame':    4898 obs. of  12 variables:
##  $ fixed acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free sulfur dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total sulfur dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : num  6 6 6 6 6 6 6 6 6 6 ...

We want to scale the data to allow all attributes to be compared more easily. First of all let’s split our data so we have two tables, one with all the attributes of wine and the other just with the humanly perceived quality.

wine.all_but_q <- wine.raw[1:11]
wine.q <- wine.raw$quality

#Wine properties
str(wine.all_but_q)
## Classes 'tbl_df', 'tbl' and 'data.frame':    4898 obs. of  11 variables:
##  $ fixed acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free sulfur dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total sulfur dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
#Wine quality values
str(wine.q)
##  num [1:4898] 6 6 6 6 6 6 6 6 6 6 ...

Now we scale the data

wine.scaled <- as.data.frame(scale(wine.all_but_q))

#Summary of scaled wine data
summary(wine.scaled)
##  fixed acidity      volatile acidity   citric acid      residual sugar   
##  Min.   :-3.61998   Min.   :-1.9668   Min.   :-2.7615   Min.   :-1.1418  
##  1st Qu.:-0.65743   1st Qu.:-0.6770   1st Qu.:-0.5304   1st Qu.:-0.9250  
##  Median :-0.06492   Median :-0.1810   Median :-0.1173   Median :-0.2349  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.52758   3rd Qu.: 0.4143   3rd Qu.: 0.4612   3rd Qu.: 0.6917  
##  Max.   : 8.70422   Max.   : 8.1528   Max.   :10.9553   Max.   :11.7129  
##    chlorides       free sulfur dioxide total sulfur dioxide
##  Min.   :-1.6831   Min.   :-1.95848    Min.   :-3.0439     
##  1st Qu.:-0.4473   1st Qu.:-0.72370    1st Qu.:-0.7144     
##  Median :-0.1269   Median :-0.07691    Median :-0.1026     
##  Mean   : 0.0000   Mean   : 0.00000    Mean   : 0.0000     
##  3rd Qu.: 0.1935   3rd Qu.: 0.62867    3rd Qu.: 0.6739     
##  Max.   :13.7417   Max.   :14.91679    Max.   : 7.0977     
##     density               pH             sulphates      
##  Min.   :-2.31280   Min.   :-3.10109   Min.   :-2.3645  
##  1st Qu.:-0.77063   1st Qu.:-0.65077   1st Qu.:-0.6996  
##  Median :-0.09608   Median :-0.05475   Median :-0.1739  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.69298   3rd Qu.: 0.60750   3rd Qu.: 0.5271  
##  Max.   :15.02976   Max.   : 4.18365   Max.   : 5.1711  
##     alcohol        
##  Min.   :-2.04309  
##  1st Qu.:-0.82419  
##  Median :-0.09285  
##  Mean   : 0.00000  
##  3rd Qu.: 0.71974  
##  Max.   : 2.99502
boxplot(wine.scaled, main="Looking at the data graphically", xlab="Wine Attributes", ylab="Scalled values") 

We can see from these box-plots that some attributes seem to have some clear outliers that would suggest erroneous data and not just natural extremes. As such, we can decide that it’s worth cleansing the data a little by removing these outliers from the dataset. For example, Alcohol on the most right column seesms to have very clear boundaries as we’d expect from wine; when that is compared with some other attributes, they seem to tell a different story: Chlorides seems to have a lot of values that are in the upper quartile, and a large distance etween min and max values but when you look it you can see there’s a gradient that suggests a normal distribution; in contrast to this, the columns Residual Sugar, Free Sulfur Dioxide and Density all seem to not only have relatively large min and max distances but there seem to be uppermost values that with nearest neighbour values that are a relatively large distance away.

Below are density line graphs to demonstrate the difference between attributes that don’t seem to have outliers compared to those that do.

plot(density(wine.scaled$`alcohol`))

plot(density(wine.scaled$`chlorides`))

plot(density(wine.scaled$`free sulfur dioxide`))

plot(density(wine.scaled$`density`))

In order to work out which attributes should be considered to have valid outliers, I’ve gone with a heuristic approach, choosing to look at the distance between the uppermost outliers for each attribute and it’s nearest neighbour.

#Create a list to populate with our tail neighbour distances
tail_deltas <- c()

for (attrib in wine.scaled) {
  #get the last two values
  data_tails <- tail(sort(attrib),2)
  #push the delta on to our list 
  tail_deltas <- c(tail_deltas, diff(data_tails))
}

#grab out attribute keys to include in our new table/frame
attributes <- names(wine.scaled)

#make a new dataframe from 
dataframe <- data.frame(attributes = attributes, tail_neighbour_d=tail_deltas)

#get the order for the nearest neighbour starting with the greatest distance and descending
neighbout_order <- order(dataframe$tail_neighbour_d, decreasing=TRUE)

#now apply the order to the frame
sorted_attributes_by_neighbour_d <- dataframe[ neighbout_order, ]
sorted_attributes_by_neighbour_d
##              attributes tail_neighbour_d
## 8               density        9.5890647
## 6   free sulfur dioxide        8.3788351
## 4        residual sugar        6.7428254
## 3           citric acid        3.5531375
## 1         fixed acidity        2.8440459
## 5             chlorides        2.0596881
## 7  total sulfur dioxide        1.7294905
## 2      volatile acidity        0.9425113
## 10            sulphates        0.1752452
## 11              alcohol        0.1218897
## 9                    pH        0.0662249

Given the findings, I think we can just consider the top five attributes in the above list as ones to cleanse for outliers. A lot of sources online warn against arbitrarily getting rid of outliers because it might be the case that valid information is being lost when what you really wnat to be account for is bad data.

To clarify, the attrbibutes to be processed are: - density
- free sulfur dioxide
- residual sugar
- citric acid
- fixed acidity

Boxplot has an outlier property that we can use to collect values that we might want to remove, so this is the one option we will look at for cleansing data.

wine.scaled_cleansed_bp <- wine.scaled[ !(wine.scaled$density %in% boxplot(wine.scaled$density, plot=FALSE)$out), ]
wine.scaled_cleansed_bp <- wine.scaled_cleansed_bp[ !(wine.scaled_cleansed_bp$`free sulfur dioxide` %in% boxplot(wine.scaled$`free sulfur dioxide`, plot=FALSE)$out), ]
wine.scaled_cleansed_bp <- wine.scaled_cleansed_bp[ !(wine.scaled_cleansed_bp$`residual sugar` %in% boxplot(wine.scaled_cleansed_bp$`residual sugar`, plot=FALSE)$out), ]
wine.scaled_cleansed_bp <- wine.scaled_cleansed_bp[ !(wine.scaled_cleansed_bp$`citric acid` %in% boxplot(wine.scaled_cleansed_bp$`citric acid`, plot=FALSE)$out), ]
wine.scaled_cleansed_bp <- wine.scaled_cleansed_bp[ !(wine.scaled_cleansed_bp$`fixed acidity` %in% boxplot(wine.scaled_cleansed_bp$`fixed acidity`, plot=FALSE)$out), ]

boxplot(wine.scaled_cleansed_bp, main="Looking at the cleansed data graphically", xlab="Wine Attributes", ylab="Scalled values") 

While this new set of data is now has no values beyond the outermost quartile ranges, this is argaubly too harsh a treatment. An alternative option is to arbitrarily work with the interquartile ranges; what have done is to tweek the multiplier of the interquartile range until it successfully meant that only the most extreme outliers were discard. In the end a value 5 times that of the IQR worked well to pick off only values at the very tips of the tails.

#Get the top 5 variables with the highest outlier distance
worst_outliers <- head(sorted_attributes_by_neighbour_d$attributes, n=5)

wine.scaled_cleansed_iqr <- wine.scaled

# Create a variable to store the row id's to be removed
iqr_outliers <- c()
quartile_multiplier = 5

# Loop through the list of columns you specified
for(i in worst_outliers){

  # Get the Min/Max values
  max <- quantile(wine.scaled_cleansed_iqr[,i],0.75, na.rm=FALSE) + (IQR(wine.scaled_cleansed_iqr[,i], na.rm=FALSE) * quartile_multiplier )
  min <- quantile(wine.scaled_cleansed_iqr[,i],0.25, na.rm=FALSE) - (IQR(wine.scaled_cleansed_iqr[,i], na.rm=FALSE) * quartile_multiplier )
  
  # Get the id's using which
  idx <- which(wine.scaled_cleansed_iqr[,i] < min | wine.scaled_cleansed_iqr[,i] > max)
  
  # Output the number of outliers in each variable
  #print(paste(i, length(idx), sep=' - removing: '))
  
  # Append the outliers list
  iqr_outliers <- c(iqr_outliers, idx) 
}

# Sorting outliers
iqr_outliers <- sort(iqr_outliers)

# Remove the outliers
wine.scaled_cleansed_iqr <- wine.scaled_cleansed_iqr[-iqr_outliers,]

boxplot(wine.scaled_cleansed_iqr, main="Looking at the IQR cleansed data graphically", xlab="Wine Attributes", ylab="Scalled values") 

Now that the data looks a lot cleaner, it’s time to start working with the data to try and find the best clustering. To begin with, nbclust will be used to see if that produces anything useful.

set.seed(1234)

number_of_clusters <- NbClust(wine.scaled_cleansed_iqr,
                 min.nc=2, max.nc=15,
                 method="kmeans")
## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 9 proposed 2 as the best number of clusters 
## * 3 proposed 3 as the best number of clusters 
## * 2 proposed 4 as the best number of clusters 
## * 3 proposed 5 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 1 proposed 13 as the best number of clusters 
## * 3 proposed 14 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

The follwoing table displays the results recommending potential values for k

table(number_of_clusters$Best.n[1,])
## 
##  0  2  3  4  5  6  8 13 14 15 
##  2  9  3  2  3  1  1  1  3  1

The bar chart more easily conveys this.

barplot(table(number_of_clusters$Best.n[1,]), 
        xlab="Numer of Clusters",
        ylab="Number of Criteria",
        main="Number of Clusters Chosen by 30 Criteria")

From the bar graph above we can see that there seems to be an clear leader in terms of suggested number of clusters, being k = 2. There are however other values that should be explored to see how they compare: 3, 5 and 14. To confirm that the accuracy of this result in terms of the best contender for, we can plot the sum of square errors and looks for a pronounced bend in the graph. Where the most pronounced bend is, this is a contender for the value for k.

sse_list <- 0
for (i in 1:15){
  sse_list[i] <- sum(kmeans(wine.scaled_cleansed_iqr, centers=i)$withinss)
}
## Warning: did not converge in 10 iterations
plot(1:15,
  sse_list,
  type="b",
  xlab="Number of Clusters",
  ylab="Within groups sum of squares")

The histogram for the Sum of Square Errors partially backs up the results of nbclust seeing as there is ‘elbow’ on the line at 2 on the Number of Clusters. Having said that, the kink between 5 and 7 suggest that this range should also be tested for k.

#If we're going to run tests on the k-means against the data we need to remove the outliers from our quality ccolumn too
wine.q_cleansed <- wine.q[-iqr_outliers]

set.seed(1234)
fit.km2 <- kmeans(wine.scaled_cleansed_iqr, 2)
fit.km3 <- kmeans(wine.scaled_cleansed_iqr, 3)
fit.km4 <- kmeans(wine.scaled_cleansed_iqr, 4)
fit.km5 <- kmeans(wine.scaled_cleansed_iqr, 5)
fit.km6 <- kmeans(wine.scaled_cleansed_iqr, 6)
fit.km7 <- kmeans(wine.scaled_cleansed_iqr, 7)
fit.km11 <- kmeans(wine.scaled_cleansed_iqr, 11)
fit.km14 <- kmeans(wine.scaled_cleansed_iqr, 14)

plotcluster(wine.scaled_cleansed_iqr, fit.km2$cluster)

plotcluster(wine.scaled_cleansed_iqr, fit.km3$cluster)

plotcluster(wine.scaled_cleansed_iqr, fit.km4$cluster)

plotcluster(wine.scaled_cleansed_iqr, fit.km5$cluster)

plotcluster(wine.scaled_cleansed_iqr, fit.km6$cluster)

plotcluster(wine.scaled_cleansed_iqr, fit.km7$cluster)

plotcluster(wine.scaled_cleansed_iqr, fit.km11$cluster)

plotcluster(wine.scaled_cleansed_iqr, fit.km14$cluster)

MappingFitting the clusters to the data

Now that we have experiemented with various differnt values for k, when applying k-means clustering to the wine data, it’s now time to see if it can be fit to the data that delivers anything ibviously meaningful with regards to wine quality. While the strongest cluster coption appears to be 2 I thought it was worth looking at how to map the data against quality by looking at the quality values as though they were factors; so, as you can see in the follwing table, there are only 7 unique values out of 10 possible scores for quality, meaning that trying to fit the data is actually easiest against 7 clusters, one per quality value.

wine.q_table <- table(wine.q_cleansed)
wine.q_table
## wine.q_cleansed
##    3    4    5    6    7    8    9 
##   19  163 1456 2190  880  175    5
barplot(wine.q_table,
        xlab="Quality values",
        ylab="Frequency",
        main="Distribution of wines across quality values")

#confuseTable.km2 <- table(wine.q, fit.km2$cluster)
#confuseTable.km5 <- table(wine.q, fit.km5$cluster)
#confuseTable.km6 <- table(wine.q, fit.km6$cluster)
confuseTable.km7 <- table(wine.q_cleansed, fit.km7$cluster)
#confuseTable.km14 <- table(wine.q, fit.km14$cluster)

names(dimnames(confuseTable.km7)) <- list("Quality", "Clusters")

#confuseTable.km2
#confuseTable.km5
#confuseTable.km6
confuseTable.km7
##        Clusters
## Quality   1   2   3   4   5   6   7
##       3   4   2   4   1   2   3   3
##       4  45  20  18   3  53   9  15
##       5 272 185 333  47 233 328  58
##       6 458 499 323  45 103 373 389
##       7 162 219  50   2  15  82 350
##       8  29  38   5   2   3  21  77
##       9   1   0   0   0   0   0   4
#confuseTable.km14

randIndex(confuseTable.km7)
##        ARI 
## 0.03410079
#this looks rubbish, let's try creating a factor of category, poor, okay, good and see if that an be matched up to a cluster of 3 (https://www.r-bloggers.com/from-continuous-to-categorical/)

Poor results

Given this low value of 0.03410079, is so far from the ideal, and as can be seen from the matrix, there seems to be a spread across all clusters, we can surmist that either the White Wine dataset was not cleansed thoroughly enough or that k-means clustering i=simply isn’t an effective way of determinig quality.

In order to be sure that it is indeed the methodology that is unsuitable rather than the data being insufficiently processed, looking at a more severe form of data cleansing may prove insightful; to that end, removing all boxplot outliers across all variables and running the whole process again is worth it just to see if the results are more conclusive.

wine.properties <- names(wine.all_but_q)
# 
wine.scaled_cleansed_bp_all <- wine.scaled

wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$density %in% boxplot(wine.scaled_cleansed_bp_all$density, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$`free sulfur dioxide` %in% boxplot(wine.scaled_cleansed_bp_all$`free sulfur dioxide`, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$`residual sugar` %in% boxplot(wine.scaled_cleansed_bp_all$`residual sugar`, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$`citric acid` %in% boxplot(wine.scaled_cleansed_bp_all$`citric acid`, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$`fixed acidity` %in% boxplot(wine.scaled_cleansed_bp_all$`fixed acidity`, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$`volatile acidity` %in% boxplot(wine.scaled_cleansed_bp_all$`volatile acidity`, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$chlorides %in% boxplot(wine.scaled_cleansed_bp_all$chlorides, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$`total sulfur dioxide` %in% boxplot(wine.scaled_cleansed_bp_all$`total sulfur dioxide`, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$pH %in% boxplot(wine.scaled_cleansed_bp_all$pH, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$sulphates %in% boxplot(wine.scaled_cleansed_bp_all$sulphates, plot=FALSE)$out), ]
wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all$alcohol %in% boxplot(wine.scaled_cleansed_bp_all$alcohol, plot=FALSE)$out), ]

# for (prop in wine.properties) {
#   wine.scaled_cleansed_bp_all <- wine.scaled_cleansed_bp_all[ !(wine.scaled_cleansed_bp_all[prop] %in% boxplot(wine.scaled_cleansed_bp_all[prop], plot=FALSE)$out), ]
# }

boxplot(wine.scaled_cleansed_bp_all, main="Boxplot all outliers cleansed", xlab="Wine Attributes", ylab="Scalled values")

set.seed(1234)

number_of_clusters_severe_cleanse <- NbClust(wine.scaled_cleansed_bp_all,
                 min.nc=2, max.nc=15,
                 method="kmeans")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 11 proposed 2 as the best number of clusters 
## * 5 proposed 3 as the best number of clusters 
## * 2 proposed 4 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## * 4 proposed 14 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

The follwoing table displays the results recommending potential values for k

table(number_of_clusters_severe_cleanse$Best.n[1,])
## 
##  0  2  3  4 10 14 15 
##  2 11  5  2  1  4  1

The bar chart more easily conveys this.

barplot(table(number_of_clusters_severe_cleanse$Best.n[1,]), 
        xlab="Numer of Clusters",
        ylab="Number of Criteria",
        main="Number of Clusters Chosen by 30 Criteria")

sse_list <- 0
for (i in 1:15){
  sse_list[i] <- sum(kmeans(wine.scaled_cleansed_bp_all, centers=i)$withinss)
}

plot(1:15,
  sse_list,
  type="b",
  xlab="Number of Clusters",
  ylab="Within groups sum of squares")

#If we're going to run tests on the k-means against the severelt cleansed data we need to remove the outliers from our quality ccolumn too
bp_server_outliers <- unique(unlist(mapply(function(x, y) sapply(setdiff(x, y), function(d) which(x==d)), wine.scaled, wine.scaled_cleansed_bp_all)))

wine.q_cleansed_severe <- wine.q[-bp_server_outliers]

set.seed(1234)
fit_severe.km2 <- kmeans(wine.scaled_cleansed_bp_all, 2)
fit_severe.km3 <- kmeans(wine.scaled_cleansed_bp_all, 3)
fit_severe.km4 <- kmeans(wine.scaled_cleansed_bp_all, 4)
fit_severe.km5 <- kmeans(wine.scaled_cleansed_bp_all, 5)
fit_severe.km6 <- kmeans(wine.scaled_cleansed_bp_all, 6)
fit_severe.km7 <- kmeans(wine.scaled_cleansed_bp_all, 7)
fit_severe.km11 <- kmeans(wine.scaled_cleansed_bp_all, 11)
fit_severe.km14 <- kmeans(wine.scaled_cleansed_bp_all, 14)

plotcluster(wine.scaled_cleansed_bp_all, fit_severe.km2$cluster)

plotcluster(wine.scaled_cleansed_bp_all, fit_severe.km3$cluster)

plotcluster(wine.scaled_cleansed_bp_all, fit_severe.km4$cluster)

plotcluster(wine.scaled_cleansed_bp_all, fit_severe.km5$cluster)

plotcluster(wine.scaled_cleansed_bp_all, fit_severe.km6$cluster)

plotcluster(wine.scaled_cleansed_bp_all, fit_severe.km7$cluster)

plotcluster(wine.scaled_cleansed_bp_all, fit_severe.km11$cluster)

plotcluster(wine.scaled_cleansed_bp_all, fit_severe.km14$cluster)

confuseTable_severe.km7 <- table(wine.q_cleansed_severe, fit_severe.km7$cluster)

names(dimnames(confuseTable_severe.km7)) <- list("Quality", "Clusters")

confuseTable_severe.km7
##        Clusters
## Quality   1   2   3   4   5   6   7
##       3   0   1   4   1   2   0   1
##       4   7  10  30   9  16   7   9
##       5  83 289 167  25 178 180 179
##       6 220 307 306 261 342 203 225
##       7 148  39 107 266 125  58  52
##       8  25   6  26  50  13  17   8
##       9   0   0   0   4   0   0   0
randIndex(confuseTable_severe.km7)
##        ARI 
## 0.02536692
#this looks rubbish, let's try creating a factor of category, poor, okay, good and see if that an be matched up to a cluster of 3 (https://www.r-bloggers.com/from-continuous-to-categorical/)

Results so far

Given that the poor value of 0.03410079 for the dataset that was only lightly ‘pruned’ is actually closer to 1 than the 0.02536692 of the last result, the previous dataset actually led to a better set of clusters to act as predictors of Quality.

Alternative clusters

It would appear that the less intensive data cleansing was more appropriate if the ARI value is anthing to go by. To refresh what the confusion matrix for that looked like, it is repeated below:

confuseTable.km7
##        Clusters
## Quality   1   2   3   4   5   6   7
##       3   4   2   4   1   2   3   3
##       4  45  20  18   3  53   9  15
##       5 272 185 333  47 233 328  58
##       6 458 499 323  45 103 373 389
##       7 162 219  50   2  15  82 350
##       8  29  38   5   2   3  21  77
##       9   1   0   0   0   0   0   4
randIndex(confuseTable.km7)
##        ARI 
## 0.03410079
#this looks rubbish, let's try creating a factor of category, poor, okay, good and see if that an be matched up to a cluster of 3 (https://www.r-bloggers.com/from-continuous-to-categorical/)

If we look at this more closely, we can see that while there are wines of various qualities spread across all clusters, there are some clusters are weighted in favour of higher quality wines or the middle range. From this observation it can be posited that some more meaningful fitting might be found between factors of quality, “Good”, “Mediocre” and “Bad”. So one final experiment before drawing to a conclusion is to try to fit the data against 3 clusters.

Creating 3 quality factors & attempting one last fit.

wine.q_cleansed_f3 <- cut(wine.q_cleansed, 3, labels = c("bad", "mediocre", "good"))

confuseTable.km3 <- table(wine.q_cleansed, fit.km3$cluster)
names(dimnames(confuseTable.km3)) <- list("Quality", "Clusters")

confuseTable.km3_f3 <- table(wine.q_cleansed_f3, fit.km3$cluster)
names(dimnames(confuseTable.km3_f3)) <- list("Quality", "Clusters")

confuseTable.km3
##        Clusters
## Quality   1   2   3
##       3   2   7  10
##       4  49  72  42
##       5 287 384 785
##       6 732 668 790
##       7 464 275 141
##       8  92  56  27
##       9   3   2   0
randIndex(confuseTable.km3)
##        ARI 
## 0.03450436
confuseTable.km3_f3
##           Clusters
## Quality       1    2    3
##   bad       338  463  837
##   mediocre 1196  943  931
##   good       95   58   27
randIndex(confuseTable.km3_f3)
##        ARI 
## 0.02540406

The results of these other attempts to fit the data against 3 clusters has not yielded (significantly) better results. The very last thing is to see how the suggestion by NbClust works out.

Fitting to NbClust suggested k = 2

wine.q_cleansed_f2 <- cut(wine.q_cleansed, 2, labels = c("bad", "good"))

confuseTable.km2 <- table(wine.q_cleansed, fit.km2$cluster)
names(dimnames(confuseTable.km2)) <- list("Quality", "Clusters")

confuseTable.km2_f2 <- table(wine.q_cleansed_f2, fit.km3$cluster)
names(dimnames(confuseTable.km2_f2)) <- list("Quality", "Clusters")

confuseTable.km2
##        Clusters
## Quality    1    2
##       3    8   11
##       4  109   54
##       5  608  848
##       6 1329  861
##       7  725  155
##       8  147   28
##       9    4    1
randIndex(confuseTable.km2)
##        ARI 
## 0.02553242
confuseTable.km2_f2
##        Clusters
## Quality    1    2    3
##    bad  1070 1131 1627
##    good  559  333  168
randIndex(confuseTable.km2_f2)
##        ARI 
## 0.03430092

Writing up for the best results

According to the ARI values the highest being 0.03450436, for 3 clusters with 7 unique quality values, this would be the most successful k-means clustering of those explored, though only marginally. So before we reach out conclusion it’s necessary to display the characteristics of this particular set of clusters for k = 3.

#fit.km3

#K-means clustering with 3 clusters of sizes:
fit.km3$size
## [1] 1629 1464 1795
#Cluster means:
fit.km3$centers
##   fixed acidity volatile acidity citric acid residual sugar  chlorides
## 1    -0.7473107     -0.009347355  -0.3723115     -0.5931782 -0.2719605
## 2     0.6708559     -0.039372941   0.1095180     -0.4862211 -0.2011769
## 3     0.1225661      0.034860574   0.2214525      0.9248948  0.4093270
##   free sulfur dioxide total sulfur dioxide    density         pH
## 1          -0.2156091           -0.3803302 -0.6638715  0.7894287
## 2          -0.5527267           -0.5576918 -0.4936294 -0.6102840
## 3           0.6377623            0.7928609  0.9956100 -0.2189245
##     sulphates    alcohol
## 1  0.23134426  0.5336605
## 2 -0.33340767  0.4308508
## 3  0.06153037 -0.8394476

Conclusion

By the looks of it, this use of k-means is simply not appropriate against this set of data; it would seem that either using Principal Component Analysis or applying a method for being selective about which variables are used when looking for certain trends would be required in order to reduce the noise that comes from having so many dimensions. Additionally, would consider initially testing against a smaller sample of data next time before investing so many CPU cycles to this task!

4th header

para 3/4